Whitespace and comment tokens by WebFreak001 · Pull Request #387 · dlang-community/libdparse

WebFreak001 · 2020-04-08T13:51:09Z

This replaces the previous comment and trailingComment properties. Maintains full feature backwards-compatibility, only some @nogc code may now fail to compile if it used the comment or trailingComment properties from Token.

Creates a new trivia.d file which contains helpers to deal with trivia and moves the previous comment helper in lexer.d into this file.

Simplifies getTokensForParser implementation too.

I suggest looking at each commit separately, they are strictly separated in functionality.

Commit add unittest for current comment behavior (15e117a) just adds unittests to the previous behavior, it passed how it was that way.
Commit separate comment helpers into dparse.trivia (afe19b2) moves the comment helper structs to dparse.trivia, also copies an internal function and enum which was previously inside the getTokensForParser function
Commit make token attributes not apply to extraFields (4d93669) moves the extraFields mixin outside the @safe nothrow pure @nogc block, so it can decide on the attributes itself - this may introduce changes in other software using std.experimental.lexer as it weakens attributes
Commit attach whitespace & comment tokens to tokens (db2a669) does the actual changes: remove comment parsing from the getTokensForParser function, instead append trivia (almost the same logic, just simpler), adds compatibility getters to the Token struct and unittests a big block of trivia heavy code

Supersedes #149

for us this only changes the @nogc not being there anymore. This allows for more control within TokenStructure aliases to create GC dependent functions for example.

This replaces the previous comment and trailingComment properties. Maintains full feature backwards-compatibility, only some `@nogc` code may now fail to compile if it used the comment or trailingComment properties from Token. supersedes dlang-community#149

WebFreak001 · 2020-04-08T14:05:48Z

also nice change that was introduced alongside: Error messages containing the Token type are now much more readable:
src/dparse/trivia.d(516,5): Error: mutable method dparse.trivia.extractTrailingDdoc.FilterResult!(__lambda2, const(TokenStructure!(ubyte, "import dparse.lexer:TokenTriviaFields; mixin TokenTriviaFields;"))[]).FilterResult.empty is not callable using a const object
vs
src/dparse/trivia.d(516,5): Error: mutable method dparse.trivia.extractTrailingDdoc.FilterResult!(__lambda2, const(TokenStructure!(ubyte, "\x0a string comment;\x0a string trailingComment;\x0a\x0a int opCmp(size_t i) const pure nothrow @safe {\x0a if (index < i) return -1;\x0a if (index > i) return 1;\x0a return 0;\x0a }\x0a\x0a int opCmp(ref const typeof(this) other)const pure nothrow @safe {\x0a return opCmp(other.index);\x0a }\x0a"))[]).FilterResult.empty is not callable using a const object

Hackerpilot · 2020-04-09T01:05:33Z

I think the dparse_src variable in meson.build needs to be updated.

src/dparse/trivia.d

codecov · 2020-04-09T05:32:46Z

Codecov Report

Merging #387 into master will increase coverage by 0.55%.
The diff coverage is 99.23%.

@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
+ Coverage   75.53%   76.09%   +0.55%     
==========================================
  Files           8        9       +1     
  Lines        6985     7123     +138     
==========================================
+ Hits         5276     5420     +144     
+ Misses       1709     1703       -6

Impacted Files	Coverage Δ
src/std/experimental/lexer.d	`84.78% <90.00%> (ø)`
src/dparse/trivia.d	`99.40% <99.40%> (ø)`
src/dparse/lexer.d	`83.40% <100.00%> (-2.15%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c6c21d...14d5e3c. Read the comment docs.

Hackerpilot · 2020-04-09T23:03:48Z

When we tag a release with these changes we should change the version number to v0.15.0 or v0.15.0-beta.1.

RazvanN7 · 2021-06-15T10:52:03Z

I was trying to update phobos to use the latest dscanner [1], that in turn was updated to use the latest libdparse and stumbled into the same failure as @MoonlightSentinel. I lost a few hours investigating the failure and ended up at this same PR.

How can we fix this issue? It is currently blocking the phobos PR queue.

Also, we should be testing new additions against phobos (via dscanner) just to make sure that future additions do not break it.

[1] dlang/phobos#8136

RazvanN7 · 2021-06-15T10:58:00Z

@WebFreak001 any ideas for a fix?

Geod24 · 2021-06-15T13:23:30Z

So the exception itself is legit, because for some reason one of the comment is garbage:

(gdb) frame 10
#10 0x000055a2e48a1ce8 in _D8dscanner8analysis18has_public_example21HasPublicExampleCheck__T8hasDittoTC6dparse3ast19FunctionDeclarationZQBsMFxCQBoQBkQBjZb (this=0x7f8fe5a7a200, decl=0x55a2e58e4da0)
    at src/dscanner/analysis/has_public_example.d:97
97			return parseComment(decl.comment, null).isDitto;
(gdb) p *decl
$2 = {hasAuto = false, hasRef = false, returnType = 0x0, name = {text = "castSwitch", line = 240, column = 6,
    index = 7400, type = 65 'A', leadingTrivia = 0x0, trailingTrivia = 0x0, memoizedLeadingComment = 0x0,
    memoizedTrailingComment = 0x0}, templateParameters = 0x55a2e58e4f80, parameters = 0x55a2e58e50f0,
  constraint = 0x0, functionBody = 0x55a2e58e5410, memberFunctionAttributes = 0x0,
  comment = "\343,\000\000\345,\000\000\354,\000\000\355,\000\000\356,\000\000\357,\000\000\363,\000\000\364,\000\000\000-\000\000&-\000\000'-\000\000(-\000\000--\000\000.-\000\000 \246\000\000*\246\000\000A\246\000\000B\246\000\000C\246\000\000D\246\000\000E\246\000\000F\246\000\000G\246\000\000H\246\000\000I\246\000\000J\246\000\000K\246\000\000L\246\000\000M\246\000\000N\246\000\000O\246\000\000P\246\000\000Q\246\000\000R\246\000\000S\246\000\000T\246\000\000U\246\000\000V\246\000\000W\246\000\000X\246\000\000Y\246\000\000Z\246\000\000[\246\000\000\\\246\000\000]\246\000\000^\246\000\000_\246\000\000`\246\000\000a\246\000\000b\246\000\000"..., attributes = 0x0, storageClasses = {0x55a2e58e4eb0}}

Removing code / comments makes the location move. It reeks of memory corruption.
Dumping the data to binary gives:

[227, 44, 0, 0, 229, 44, 0, 0, 236, 44, 0, 0, 237, 44, 0, 0, 238, 44, 0, 0, 239, 44, 0, 0, 243, 44, 0, 0, 244, 44, 0, 0, 0, 45, 0, 0, 38, 45, 0, 0, 39, 45, 0, 0, 40, 45, 0, 0, 45, 45, 0, 0, 46, 45, 0, 0, 32, 166, 0, 0, 42, 166, 0, 0, 65, 166, 0, 0, 66, 166, 0, 0, 67, 166, 0, 0, 68, 166, 0, 0, 69, 166, 0, 0, 70, 166, 0, 0, 71, 166, 0, 0, 72, 166, 0, 0, 73, 166, 0, 0, 74, 166, 0, 0, 75, 166, 0, 0, 76, 166, 0, 0, 77, 166, 0, 0, 78, 166, 0, 0, 79, 166, 0, 0, 80, 166, 0, 0, 81, 166, 0, 0, 82, 166, 0, 0, 83, 166, 0, 0, 84, 166, 0, 0, 85, 166, 0, 0, 86, 166, 0, 0, 87, 166, 0, 0, 88, 166, 0, 0, 89, 166, 0, 0, 90, 166, 0, 0, 91, 166, 0, 0, 92, 166, 0, 0, 93, 166, 0, 0, 94, 166, 0, 0, 95, 166, 0, 0, 96, 166, 0, 0, 97, 166, 0, 0, 98, 166, 0, 0, 99, 166, 0, 0, 100, 166, 0, 0, 101, 166, 0, 0, 102, 166, 0, 0, 103, 166, 0, 0, 104, 166, 0, 0, 105, 166, 0, 0, 106, 166, 0, 0, 107, 166, 0, 0, 108, 166, 0, 0, 109, .... ]

Which is either a 4 bytes pattern or an uint:

[11491, 11493, 11500, 11501, 11502, 11503, 11507, 11508, 11520, 11558, 11559, 11560, 11565, 11566, 42528, 42538, 42561, 42562, 42563, 42564, 42565, 42566, 42567, 42568, 42569, 42570, 42571, 42572, 42573, 42574, 42575, 42576, 42577, 42578, 42579, 42580, 42581, 42582, 42583, 42584, 42585, 42586, 42587, 42588, 42589, 42590, 42591, 42592, 42593, 42594, 42595, 42596, 42597, 42598, 42599, 42600, 42601, 42602, 42603, 42604, 42605, 42606, 42625, 42626, 42627, 42628, 42629, 42630, 42631, 42632, 42633, 42634, 42635, 42636, 42637, 42638, 42639, 42640, 42641, 42642, 42643, 42644, 42645, 42646, 42647, 42648, 42787, 42788, 42789, 42790, 42791, 42792, 42793, 42794, 42795, 42796, 42797, 42798, 42799, 42802, 42803, 42804, 42805, 42806, 42807, 42808, 42809, 42810, 42811, 42812, 42813, 42814, 42815, 42816, 42817, 42818, 42819, 42820, 42821, 42822, 42823, 42824, 42825, 42826, 42827, 42828, 42829, 42830, 42831, 42832, 42833, 42834, 42835, 42836, 42837, 42838, 42839, 42840, 42841, 42842, 42843, 42844, 42845, 42846, 42847, 42848, 42849, 42850, 42851, 42852, 42853, 42854, 42855, 42856, 42857, 42858, 42859, 42860, 42861, 42862, 42863, 42864, 42865, 42873, 42874, 42875, 42876, 42877, 42879, 42880, 42881, 42882, 42883, 42884, 42885, 42886, 42887, 42888, 42892, 42893, 42894, 42895, 42897, 42898, 42899, 42900, 42913, 42914, 42915, 42916, 42917, 42918, 42919, 42920, 42921, 42922, 43002, 43003, 43216, 43226, 43264, 43274, 43472, 43482, 43600, 43610, 44016, 44026, 64256, 64263, 64275, 64280, 65296, 65306, 65345, 65371, 66600, 66640, 66720, 66730, 69734, 69744, 69872, 69882, 69942, 69952, 70096, 70106, 71360, 71370, 119834, 119860, 119886, 119893, 119894, 119912, 119938, 119964, 119990, 119994, 119995, 119996, 119997, 120004, 120005, 120016, 120042, 120068, 120094, 120120, 120146, 120172, 120198, 120224, 120250, 120276, 120302, 120328, 120354, 120380, 120406, 120432, 120458, 120486, 120514, 120539, 120540, 120546, 120572, 120597, 120598]

This doesn't evoke anything obvious to me (perhaps indexes in the file?) but there is a clear pattern.

BTW while looking at the code I noticed this change:
db2a669#diff-073a681f37b7ac58a75cb15be50882a5075a2ebcaa53efc4bd97cffa037ab6ddR472-R475
Notably, the char[] => Token[] change. The data is duped, but Token has indirections (it contains strings), if any of those string is not properly allocated it could be a very bad idea. It might be safe but since I was looking for memory corruption this one stood out.

RazvanN7 · 2021-06-17T14:33:45Z

@WebFreak001 could you provide some assistance?

WebFreak001 · 2021-06-17T16:20:37Z

I think what Mathias said is already gonna be the issue - as this only happens with huge files it's probably either something with slices of data that has moved (source code as moved and the appender didn't dup the code) or there was previously an issue in DCD with a splitting list data structure that was fundamentally flawed somehow, but had such a big threshold that it never observed in smaller files.

WebFreak001 · 2021-06-17T16:29:59Z

I have tried the zip here and a complete phobos checkout using dub -- -S phobos with ~master D-Scanner and ~master libdparse, however I could not reproduce a crash on my machine.

Also tried compiling D-Scanner with LDC in release build

Also tried make

Got the issue now, forgot the --config parameter

WebFreak001 · 2021-06-17T17:55:11Z

OK so the strings of the trivia parts are all fine in memory - the combined cached string gets overwritten by the GC (in my case I had it that std.regex was doing some allocation and the GC gave it part of that memory) which causes the error.

When the comment (string) is first assigned in parser.d in parseFunctionDeclaration the content is correct, later the same string data pointer however causes the UTFException and is no longer correct.

To me this seems like the GC either loses track of this string (thinking it's not allocated) or it's wrongly allocated.

WebFreak001 · 2021-06-25T11:19:42Z

I found one issue by accident just now:

The comment bytes are cast to string without interning (because of the config.commentBehavior = CommentBehavior.noIntern)

Maybe we shouldn't override what the user provides? Although before it should have happened too, seems like the increased memory usage now might be the cause for the code now being freed while it wasn't before.

This is not a fix for the issue, but rather prevents issues with trying to access comments later appearing. This seems to have made the original issue trigger even earlier (on an earlier token) in analyzing - indicating that it might be relevant how much memory is allocated.

The weird thing is that it is the string returned by appender!string that gets corrupted, which shouldn't happen?

Fun fact:
calling GC.addRoot(&ret.data[0]); on the appender data causes the issue to go away.

WebFreak001 · 2021-06-25T11:26:01Z

maybe the comment generation should use the normal StringCache and be generated beforehand instead of lazily computing it when needed?

Geod24 · 2021-06-26T08:12:29Z

calling GC.addRoot(&ret.data[0]); on the appender data causes the issue to go away.

Could it be that the pointer for the data is not aligned, causing the GC to miss it ?

WebFreak001 · 2021-06-27T09:16:12Z

The data is allocated by appender so the address should be allocated.

The string itself holding the pointer value to that data is inside a Token structure, which doesn't do special alignments. Here are the definitions however:

libdparse/src/std/experimental/lexer.d

Line 331 in fbdf71b

mixin (extraFields);

libdparse/src/dparse/lexer.d

Line 146 in fbdf71b

string memoizedLeadingComment = null;

These tokens are themself an array, allocated by an appender in

libdparse/src/dparse/lexer.d

Line 480 in fbdf71b

auto leadingTriviaAppender = appender!(Token[])();

I don't know, does appender remove the alignment? Otherwise it would all be aligned well.

So the data is stored like this:

list of tokens: Token[] (appender allocated)
- list of trivia per token: Token[] (appender allocated)
  - leading/trailing comment: string (interned, not corrupted on building)
- memoized combined & processed leading/trailing comment: string (appender allocated, first not corrupted - later corrupted)

RazvanN7 · 2021-07-05T09:08:09Z

Until we get to the bottom of this, would it make sense to revert this PR and update Dscanner so that the phobos AliasAssign PRs get unblocked?

WebFreak001 · 2021-07-05T09:13:32Z

I think it might be better to leave this API in as this has been in for 3 releases now already.

As a workaround it might be possible to call internString on the appender-allocated string to make it actually permanent in memory.

RazvanN7 · 2021-07-08T13:28:25Z

Would setting config.commentBehavior to CommentBehavior.intern work around the issue?

WebFreak001 · 2021-07-08T18:24:20Z

the comments of the tokens are not corrupting, they are staying fine as they are already interned (at least also did the tests using this option) - the thing that is corrupting regardless is the memoized comment/trailingComment in https://github.com/dlang-community/libdparse/pull/387/files#diff-073a681f37b7ac58a75cb15be50882a5075a2ebcaa53efc4bd97cffa037ab6ddR146 that is created using an appender.

When the string is assigned to another variable in the parser.d FunctionDeclaration it's still valid, however a little later it gets corrupted.

CyberShadow · 2021-07-24T13:53:38Z

This commit introduced a regression manifesting in D-Scanner:

Fix: #445

WebFreak001 added 3 commits April 8, 2020 11:56

add unittest for current comment behavior

15e117a

separate comment helpers into dparse.trivia

afe19b2

make token attributes not apply to extraFields

4d93669

for us this only changes the @nogc not being there anymore. This allows for more control within TokenStructure aliases to create GC dependent functions for example.

WebFreak001 requested a review from Hackerpilot April 8, 2020 13:51

WebFreak001 mentioned this pull request Apr 8, 2020

[wip] ignore warnings with @suppress(key) comments dlang-community/D-Scanner#799

Open

WebFreak001 force-pushed the whitespace-and-comment-tokens branch from 1c4eb5b to 081566a Compare April 8, 2020 14:01

WebFreak001 force-pushed the whitespace-and-comment-tokens branch from 081566a to db2a669 Compare April 8, 2020 14:02

Hackerpilot reviewed Apr 9, 2020

View reviewed changes

src/dparse/trivia.d Show resolved Hide resolved

forgot this file

14d5e3c

WebFreak001 mentioned this pull request Apr 9, 2020

Allow configureable WhitespaceBehavior in getTokensForParser #149

Closed

Hackerpilot approved these changes Apr 9, 2020

View reviewed changes

Hackerpilot added the auto-merge label Apr 9, 2020

dlang-bot merged commit 1c27b0d into dlang-community:master Apr 9, 2020

WebFreak001 mentioned this pull request Apr 28, 2020

///////////// comment lines wrongly interpreted as ddoc #394

Closed

RazvanN7 mentioned this pull request Jun 15, 2021

UTFException on dscanner --styleCheck dlang-community/D-Scanner#823

Open

WebFreak001 deleted the whitespace-and-comment-tokens branch June 17, 2021 16:53

RazvanN7 mentioned this pull request Jul 5, 2021

std.typecons.RefCounted, std.container.array.Array, & similar structs that manage their own memory do not need to be scanned unless GC-allocated memory is reachable through them dlang/phobos#7923

Closed

CyberShadow mentioned this pull request Jul 24, 2021

dparse: Fix memory corruption #445

Merged

Conversation

WebFreak001 commented Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WebFreak001 commented Apr 8, 2020

Uh oh!

Hackerpilot commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Hackerpilot commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RazvanN7 commented Jun 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RazvanN7 commented Jun 15, 2021

Uh oh!

Geod24 commented Jun 15, 2021

Uh oh!

RazvanN7 commented Jun 17, 2021

Uh oh!

WebFreak001 commented Jun 17, 2021

Uh oh!

WebFreak001 commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WebFreak001 commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WebFreak001 commented Jun 25, 2021

Uh oh!

WebFreak001 commented Jun 25, 2021

Uh oh!

Geod24 commented Jun 26, 2021

Uh oh!

WebFreak001 commented Jun 27, 2021

Uh oh!

RazvanN7 commented Jul 5, 2021

Uh oh!

WebFreak001 commented Jul 5, 2021

Uh oh!

RazvanN7 commented Jul 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WebFreak001 commented Jul 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyberShadow commented Jul 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

WebFreak001 commented Apr 8, 2020 •

edited

Loading

Hackerpilot commented Apr 9, 2020 •

edited

Loading

codecov bot commented Apr 9, 2020 •

edited

Loading

Hackerpilot commented Apr 9, 2020 •

edited

Loading

RazvanN7 commented Jun 15, 2021 •

edited

Loading

WebFreak001 commented Jun 17, 2021 •

edited

Loading

WebFreak001 commented Jun 17, 2021 •

edited

Loading

RazvanN7 commented Jul 8, 2021 •

edited

Loading

WebFreak001 commented Jul 8, 2021 •

edited

Loading